About the project

This is the Introduction to Open Data Science course by Kimmo Vehkalahti & co.

This course consists of six sections and every section has it´s weekly exercises.

  1. Tools and methods for open and reproducible research
  2. Regression and model validation
  3. Logistic regression
  4. Clustering and classification
  5. Dimensionality reduction techniques
  6. Final assignment

We are using GitHub to share our work. My repository you can find here.


Week 2: Regression and model validation

Reading the students2014 data into R

R-code and output

You can find the code pressing the “Code”-button on the right ->

getwd()
## [1] "/Users/mirva/IODS-project"
setwd("/Users/mirva/IODS-project/data")
learning2014 = read.csv("learning2014")
dim(learning2014)
## [1] 166   8
str(learning2014)
## 'data.frame':    166 obs. of  8 variables:
##  $ X       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

This data is part of the data which Kimmo Vehkalahti has collected 3.12.2014 - 10.1.2015 on the course:

Johdatus yhteiskuntatilastotieteeseen, syksy 2014 (Introduction to Social Statistics, fall 2014 - in Finnish)

The data has 166 observations and 7 variables which are gender, age, attitude, deep, stra, surf and points

  • gender has two options: M (Male), F (Female)
  • age is (in years) derived from the date of birth
  • attitude describes global attitude toward statistics
  • deep, stra and suf are on the Likert-scale
  • deep refers to deep learning and is the average score of the subsets Seeking Meaning, Relating Ideas and Use of Evidence
  • stra refers to strategic learning and is the average score of the subsets Organized Studying and Time Management * surf refers to surface learning and is the average of the subsets Lack of Purpose, Unrelated Memorising and Syllabus-boundness
  • points are the exam points of the course

Graphical overview of the data

R-code and plot

library(ggplot2)
library(GGally)
overw <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), 
             lower = list(combo = wrap("facethist", bins = 20)))
overw

summary(learning2014)
##        X          gender       age           attitude          deep      
##  Min.   :  1.00   F:110   Min.   :17.00   Min.   :14.00   Min.   :1.583  
##  1st Qu.: 42.25   M: 56   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:3.333  
##  Median : 83.50           Median :22.00   Median :32.00   Median :3.667  
##  Mean   : 83.50           Mean   :25.51   Mean   :31.43   Mean   :3.680  
##  3rd Qu.:124.75           3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:4.083  
##  Max.   :166.00           Max.   :55.00   Max.   :50.00   Max.   :4.917  
##       stra            surf           points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00

The variables are all quite normally distributed except the age-variable where the min is 17, median 22 and max 55. Also the differences between men and women are quite small, allthough the distributions are not exactly equal. Biggest overall correlations are between attitude and points (r=.437) and deep and surf (negative, r=-.324), correspondingly smallest overall correlations are between deep and points (negative, r=-.0101), attitude and age (r=.0222) and deep and age (r.0251).


Choosing the variables and fitting a regression model

y=“points”, x1=“attitude”, x2=“stra” and x3=“surf”

Model 1

ggpairs(learning2014, lower = list(combo = wrap("facethist", bins = 20)))

regr1 <- lm(points ~ attitude + stra +surf, data = learning2014)
summary(regr1)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## attitude     0.33952    0.05741   5.913 1.93e-08 ***
## stra         0.85313    0.54159   1.575  0.11716    
## surf        -0.58607    0.80138  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

T-value of attitude (5.913) is statistically significant (p<.001) so the attitude-variable explains well the points-variable. T-values of stra (1.575) and surf(-0.731) aren´t significant in this model so I will remove the one with smaller t-value: the variable surf.

Model 2

regr2 <- lm(points ~ attitude + stra, data = learning2014)
summary(regr2)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.97290    2.39591   3.745  0.00025 ***
## attitude     0.34658    0.05652   6.132 6.31e-09 ***
## stra         0.91365    0.53447   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

The p-value of the variable stra (p=.08927) remains still >.05 so i will also remove it from the model.

Model 3

regr3 <- lm(points ~ attitude, data = learning2014)
summary(regr3)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Interpretation of the model

The relationship between attitude and points: when y = “attitude” and x = “points” we can predict the value of y with model:

y = 11.63715 + 0.35255x + e

Multiple R-squared of this model is 0.1906, so the model explains 19,06% of the variation of attitude


Diagnostic plots

par(mfrow = c(2,2))
plot(regr3, which = c(1,2,5))

Interpretation of the plots

  • Residuals vs Fitted: Residuals are quite randomly around the 0 line so it could indicate that the residuals and fitted values are not correlated.
  • Normal QQ-plot: At most part the Q-Q plots follow the theoretical line quite nicely and are therefore normally distributed, but at the beginning and in the end of the line the plots aren´t so well on the line which indicates that there is light left skewness
  • Residuals vs Leverage: There seems to be three outliers (35, 56 and 71) which could influence the model so the next step would be to delete them and see how it affects R-squared

Week 3: Logistic regression

library(ggplot2); library(dplyr); library(GGally); library(boot)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Reading the data into R

At first I´ll read the alcohol data into R

getwd()
## [1] "/Users/mirva/IODS-project"
setwd("/Users/mirva/IODS-project/data")
alc = read.csv("alcohol")

The structure of the data

str(alc)
## 'data.frame':    382 obs. of  36 variables:
##  $ X         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 2 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  5 3 8 1 2 8 0 4 0 0 ...
##  $ G1        : int  2 7 10 14 8 14 12 8 16 13 ...
##  $ G2        : int  8 8 10 14 12 14 12 9 17 14 ...
##  $ G3        : int  8 8 11 14 12 14 12 10 18 14 ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...

This data includes 382 observations and 36 variables
The data are from two identical questionaires related to secondary school student alcohol comsumption in Portugal.

Database info

Using Data Mining To Predict Secondary School Student Alcohol Consumption.
Fabio Pagnotta, Hossain Mohammad Amran.
Department of Computer Science,University of Camerino
You can find more info about the variables and data here

The hypotheses

I will study the relationship between high/low alcohol consumption and four other variables: Pstatus, Medu, Fedu and famrel

My personal hypothesis about these four variables and alcohol consumption are as follows:
Pstatus = parents´ cohabitation status (living together or apart): My hypothesis is that students whose parents are living together could have lower risk of high alcohol consumption. Medu and Fedu = mother’s education and father´s education: My hypothesis is that parents´lower education (and therefore possibly lower socio economical status) could increase the risk of high alcohol consumption.
famrel = quality of family relationships: My hypothesis here is that students whose family relationships are graded 4 or 5 could have lower risk of high alcohol consumption.

Exploring the distributions

Parents´cohabilitation status

alc %>% group_by(Pstatus, high_use) %>% summarise(count = n())
## Source: local data frame [4 x 3]
## Groups: Pstatus [?]
## 
##   Pstatus high_use count
##    <fctr>    <lgl> <int>
## 1       A    FALSE    26
## 2       A     TRUE    12
## 3       T    FALSE   242
## 4       T     TRUE   102

Within students whose parents´cohabilitation status is A = “apart” 26 (68%) students´ alcohol usage is low and 12 (32%) it is high. Within students whose parents live together (status = T) 242 (70%) students´ alcohol usage is low and 102 (30%) it is high.

Mother´s education level

alc %>% group_by(Medu, high_use) %>% summarise(count = n())
## Source: local data frame [10 x 3]
## Groups: Medu [?]
## 
##     Medu high_use count
##    <int>    <lgl> <int>
## 1      0    FALSE     1
## 2      0     TRUE     2
## 3      1    FALSE    33
## 4      1     TRUE    18
## 5      2    FALSE    80
## 6      2     TRUE    18
## 7      3    FALSE    59
## 8      3     TRUE    36
## 9      4    FALSE    95
## 10     4     TRUE    40
g1me <- ggplot(alc, aes(x = high_use, y = Medu))
g1me + geom_boxplot() + xlab("High alcohol consumption") + ylab("Mothers education")

If we look at the mothers education level the results of summary are as follows:

  • No education: low drinking n=1 (33%) and high drinking n=2 (67%)
  • Primary education (4th grade): low drinking n=33 (65%) and high drinking n= 18 (35%)
  • 5th to 9th grade: low drinking n=80 (82%) and high drinking n=18 (18%)
  • Secondary education: low drinking n=59 (62%) and high drinking n=36 (38%)
  • Higher education: low drinking n=95 (70%) and high drinking n=40 (30%)

So there might be some differences between these groups but ns are quite low, especially in non-educated-group.
The boxplots of these two alcohol consumption groups look exactly the same.

Father´s education level

alc %>% group_by(Fedu, high_use) %>% summarise(count = n())
## Source: local data frame [9 x 3]
## Groups: Fedu [?]
## 
##    Fedu high_use count
##   <int>    <lgl> <int>
## 1     0    FALSE     2
## 2     1    FALSE    53
## 3     1     TRUE    24
## 4     2    FALSE    75
## 5     2     TRUE    30
## 6     3    FALSE    72
## 7     3     TRUE    27
## 8     4    FALSE    66
## 9     4     TRUE    33
g1fe <- ggplot(alc, aes(x = high_use, y = Fedu))
g1fe + geom_boxplot() + xlab("High alcohol consumption") + ylab("Fathers education")

With fathers´education the same numbers are as follows:

  • no education: low drinking n=2 (100%) and high drinking n=0
  • primary education (4th grade): low drinking n=53 (69%) and high drinking n= 24 (31%)
  • 5th to 9th grade: low drinking n=75 (71%) and high drinking n=30 (29%)
  • secondary education: low drinking n=72 (73%) and high drinking n=27 (27%)
  • higher education: low drinking n=66 (67%) and high drinking n=33 (33%)

So also here we can see that fathers´higher education doesn´t possibly imply lower drinking percentages as opposed to my hypothesis.

Quality of family relations

alc %>% group_by(famrel, high_use) %>% summarise(count = n())
## Source: local data frame [10 x 3]
## Groups: famrel [?]
## 
##    famrel high_use count
##     <int>    <lgl> <int>
## 1       1    FALSE     6
## 2       1     TRUE     2
## 3       2    FALSE    10
## 4       2     TRUE     9
## 5       3    FALSE    39
## 6       3     TRUE    25
## 7       4    FALSE   135
## 8       4     TRUE    54
## 9       5    FALSE    78
## 10      5     TRUE    24
g1fr <- ggplot(alc, aes(x = high_use, y = Fedu))
g1fr + geom_boxplot() + xlab("High alcohol consumption") + ylab("Quality of family relations")

If we look at the connection between alcohol consumption and quality of family relations distribution in different grade groups are:

  • grade 1: low drinkers 6 (75%), high drinkers 2 (25%)
  • grade 2: low drinkers 10 (53%), high drinkers 9 (47%)
  • grade 3: low drinkers 39 (61%), high drinkers 25 (39%)
  • grade 4: low drinkers 135 (71%), high drinkers 54 (29%)
  • grade 5: low drinkers 78 (76%), high drinkers 24 (24%)

So there might be some support for my hypothesis that better the family relations lower the alcohol consumption

Logistic regression and the relationship between the chosen variables and high/low alcohol consumption

m <- glm(high_use ~ Pstatus + Medu + Fedu + famrel, data = alc, family = "binomial")
summary.glm(m)
## 
## Call:
## glm(formula = high_use ~ Pstatus + Medu + Fedu + famrel, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1682  -0.8475  -0.7827   1.4193   1.7343  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  0.10331    0.66634   0.155   0.8768  
## PstatusT    -0.04070    0.37567  -0.108   0.9137  
## Medu        -0.01669    0.13690  -0.122   0.9029  
## Fedu         0.06193    0.13545   0.457   0.6475  
## famrel      -0.26540    0.12146  -2.185   0.0289 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 460.64  on 377  degrees of freedom
## AIC: 470.64
## 
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept)    PstatusT        Medu        Fedu      famrel 
##  0.10331492 -0.04070326 -0.01669495  0.06193416 -0.26540413

Only the variable quality of family relations is significant in this model.

Odds ratios and confidence intervals

OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
##                    OR     2.5 %    97.5 %
## (Intercept) 1.1088405 0.2964832 4.0825324
## PstatusT    0.9601140 0.4688583 2.0693320
## Medu        0.9834436 0.7515044 1.2871504
## Fedu        1.0638923 0.8164830 1.3904537
## famrel      0.7668960 0.6034634 0.9732521

Every one of these confidence intervals includes 1 except famrel so it implies these other variables make no difference in the model.

Removing non-significant variables one by one doesn´t give any new variables to the model

m2 <- glm(high_use ~ Medu + Fedu + famrel, data = alc, family = "binomial")
summary.glm(m2)
## 
## Call:
## glm(formula = high_use ~ Medu + Fedu + famrel, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1715  -0.8458  -0.7703   1.4164   1.7327  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  0.06539    0.56687   0.115   0.9082  
## Medu        -0.01521    0.13618  -0.112   0.9111  
## Fedu         0.06190    0.13546   0.457   0.6477  
## famrel      -0.26612    0.12130  -2.194   0.0282 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 460.65  on 378  degrees of freedom
## AIC: 468.65
## 
## Number of Fisher Scoring iterations: 4
m3 <- glm(high_use ~ Fedu + famrel, data = alc, family = "binomial")
summary.glm(m3)
## 
## Call:
## glm(formula = high_use ~ Fedu + famrel, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1732  -0.8400  -0.7666   1.4146   1.7244  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  0.04792    0.54493   0.088   0.9299  
## Fedu         0.05207    0.10288   0.506   0.6128  
## famrel      -0.26610    0.12129  -2.194   0.0282 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 460.67  on 379  degrees of freedom
## AIC: 466.67
## 
## Number of Fisher Scoring iterations: 4

So I´ll leave just the variable famrel to the model

m4 <- glm(high_use ~ famrel, data = alc, family = "binomial")
summary.glm(m4)
## 
## Call:
## glm(formula = high_use ~ famrel, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1404  -0.8323  -0.7427   1.4483   1.6869  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   0.1772     0.4812   0.368   0.7127  
## famrel       -0.2649     0.1212  -2.185   0.0289 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 460.92  on 380  degrees of freedom
## AIC: 464.92
## 
## Number of Fisher Scoring iterations: 4

The predictive power of my model

probabilities <- predict(m4, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE
##    FALSE   268
##    TRUE    114
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
g2 <- g + geom_point()
g2

table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
##         prediction
## high_use     FALSE       Sum
##    FALSE 0.7015707 0.7015707
##    TRUE  0.2984293 0.2984293
##    Sum   1.0000000 1.0000000
m <- glm(high_use ~ failures + absences + sex, data = alc, family = "binomial")

So this means that the predictive power of my model is low (and probably more significant variables should be found to make it more predictive.) As long as all the predictions are under 0.5 the level of prediction is same with guessing.

Mean prediction error

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2984293

Mean prediction error is approx. 0.298 which means that so many of predictions are false.

10-fold cross validation

cv <- cv.glm(data = alc, cost = loss_func, glmfit = m4, K = 10)
cv$delta[1]
## [1] 0.3036649

The average number of wrong predictions in the cross validation is approx. 0.304

An example of a better model

Another model where prediction error is smaller compared to the model introduced in DataCamp has variables: absences, sex, famrel and goout. It´s prediction error is about 0.207

m5 <- glm(high_use ~ absences + sex + famrel + goout, data = alc, family = "binomial")
summary.glm(m5)
## 
## Call:
## glm(formula = high_use ~ absences + sex + famrel + goout, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7151  -0.7820  -0.5137   0.7537   2.5463  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.76826    0.66170  -4.184 2.87e-05 ***
## absences     0.08168    0.02200   3.713 0.000205 ***
## sexM         1.01234    0.25895   3.909 9.25e-05 ***
## famrel      -0.39378    0.14035  -2.806 0.005020 ** 
## goout        0.76761    0.12316   6.232 4.59e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 379.81  on 377  degrees of freedom
## AIC: 389.81
## 
## Number of Fisher Scoring iterations: 4
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m5, K = 10)
cv$delta[1]
## [1] 0.2146597

Week 4: Clustering and classification

Loading the Boston data from the MASS package

setwd("/Users/mirva/IODS-project/")
library(MASS); library(corrplot); library(dplyr); library(ggplot2); library(GGally)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data("Boston")

The structure and the dimensions of the data

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

This data includes 506 observations and 14 variables which are: crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black, lstat and medv. You can find the descriptions of the variables here

A graphical overview of the data

Drawing bar plots of the distributions of these variables isn´t very useful so compair them with pairs and corrplot

pairs(Boston)

As you can see the connections between these variables vary; some connections are linear (t.ex. between age and dis) and some are non-linear (t.ex between lstat and medv) and some don´t seem to be connected at all (t.ex. between black and rm )

An easier way to look at the correlations is correlation plot

cor_matrix<-cor(Boston) 
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

As you can see here, the colors and shades describe the direction and the strength of the correlation between variables - the stronger the shade the stronger the correlation. There seems to be very strong positive correlation t.ex. between rad and tax and strong negative correlation t.ex. between age and dis. On the other hand there seems to be very weak negative correlation between t.ex. ptratio and black, zn and crim and rm and rad.

From correlation matrix we can see correlation coefficients between these variables.

cor_matrix %>% round(digits = 2)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00

As you can see t.ex. the strong positive correlation between rad and tax, which we saw earlier in the correlation plot, has coefficient r=.91 and weak negative correlation between ptratio and black has coefficient r=-.18.

Standardizing the dataset

I´ll standardize (scaled(x)=(x−mean(x))/sd(x)) the Boston dataset using scale() function because linear discrimination analysis is based on assumptions that variables are normally distributed and their variances are same.

boston_scaled <- scale(Boston)
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

If we compare the summaries of Boston and boston_scaled we can see that the mean of every variable changed to 0 (so the variables have also negative values now) and also the range is much smaller.

Creating a categorial variable

At first I´ll convert the boston_scaled to a data frame format.

class(boston_scaled)
## [1] "matrix"
boston_scaled <- as.data.frame(boston_scaled)

Then I´ll save the scaled crime rate

scaled_crim <- boston_scaled$crim

After that I`ll use the quantiles as the break points in the categorical variable and create a categorial variable “crime”

bins <- quantile(scaled_crim)
crime <- cut(scaled_crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))

I´ll remove original crim from the dataset and add the “crime” to scaled data

boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)

Dividing the dataset to train and test sets

Now I´ll divide the dataset to train and test sets, so that 80% of the data belongs to the train set

n <- nrow(boston_scaled)
ind <- sample(n,  size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]

Fitting the linear discriminant analysis on the train set

Now I´ll fit a linear discriminant analysis with the function lda() and use the “crime” as target variable and other variables as predictors.

lda.fit <- lda(crime ~ ., data = train)

The LDA biplot

Here is the LDA biplot of this analysis

classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes)

Saving the crime categories and removing the categorical crime variable

To save the crime categories from the test set I´ll create “correct_classes”

correct_classes <- test$crime

After that I´ll remove the categorical crime variable from the test dataset.

test <- dplyr::select(test, -crime)

Predicting the classes with the LDA model

Next I will predict the classes with the LDA model on the test data using function predict()

lda.pred <- predict(lda.fit, newdata = test)

Then I´ll cross tabulate the results

table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       16       7        2    0
##   med_low    5      14        3    0
##   med_high   0      12       16    1
##   high       0       0        1   25

So as you can see the predicted variables are columns and correct values are rows. If we look at first at the row where the correct value is low we can see that the most of the predicted values are in the class low (n=16) but there are also many values in the class med_low(12) and even some in the med_high (n=2) class. We can see same with the correct med_low class: most of the predicted values are on the med_low class (n=15) but almost as many are in the med_high group (n=14) and some in the low group (n=3). The situation with correct med_high and high classes is much better. In the correct med_high class there are 18 values in the predicted med_high class and only 2 in high class. In correct high class every value (n=20) is in the predicted high class.

Distance measures

At first I´ll reload the Boston data

data("Boston")

After that I´ll scale the variables to get comparable distances as I did earlier in this exercise

boston_scaled <- scale(Boston)

Then I´ll calculate the distances between the observations by creating an euclidean distance matrix

dist_eu <- dist(Boston)

K-means

Now I´ll first run the k-means algorithm where the number of clusters is 10

km <-kmeans(dist_eu, centers = 10)
pairs(Boston, col = km$cluster)

The plot seems to be quite messy so maybe there are too many clusters

To investigate the optimal number of clusters I will calculate the total within sum of squares

set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})
plot(1:k_max, twcss, type='b')

If we look at the plot we can see that the TWCSS drops radically when the number of clusters is two so that might be the optimal number of them.

I will run the k-means algorithm again

km <-kmeans(dist_eu, centers = 2)

Now if we draw the plot again we can see that they are much easier to interpret

pairs(Boston, col = km$cluster)

As we can see in many cases here (t.ex. between age and indus) the red clusters seem to be quite random but the black clusters seem to have some clear pattern.


Week 5: Dimensionality reduction techniques

This weeks data is part of the UN Development Programme, Human Development Reports, which you can find here The data is combined from two different data sets: Human Development Index and its components and Gender Inequality Index

At first I will load the ‘human’ data and then explore it´s structure and dimensions.

Loading the data

setwd("/Users/mirva/IODS-project/data")
human <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt", sep=",", header=TRUE)

Structure and dimensions

str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
dim(human)
## [1] 155   8

As you can see there are 155 observations and 8 variables in this data. The variables are as follows: * Edu2.FM = The ratio of female and male populations with secondary education in each country (female/ male) * Labo.FM = The ratio of labour force participation of females and males in each country (female/male) * Life.Exp = Life expectancy at birth (years) * Edu.Exp = Expected years of schooling * GNI = Gross national income (GNI) per capita * Mat.Mor = Maternal mortality ratio (deaths per 100 000 live births) * Ado.Birth = Adolescent birth rate (births per 1 000 women ages 15–19) * Parli.F = Share of seats in parliament (% held by women)

Now that you know what the variables used here are let´s look at the graphics and summaries of the data.

The graphical overview of the data

library(GGally); library(dplyr); library(corrplot)
ggpairs(human)

cor(human) %>% corrplot()

As you can see from the correlation plots above some correlations are quite strong and some are very minimal. Strongest correlations we can find are between Mat.Mor and Life.Exp (-.857), Edu.Exp and Life.Exp (r=.789), Mat.Mor and Ado.Birth (r=.759), Life.Exp and GNI (r=.627) and Edu.Exp and GNI (r=.624). Weakest correlations are between Edu2.FM and Labo.FM (r=.00956), Labo.FM and GNI (r=-.0217), Labo.FM and Edu.Exp (r=.0473), Ado.Birth and Parli.F (r=-.0709) and Parli.F and Edu2.FM (r=.0786).

Summaries of the variables in the data

summary(human)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

As we can see from the summaries and also from the upper of the correlation plots above the median of the Edu2.FM is around 0.9375 so the ratio of female and male populations with secondary education is almost 1 meaning almost equal ratio of education. But in some country the minimum is 0.1717 meaning that the women get 2nd education rarely compared to men. The maximum here is 1.4967 meaning that in some countries women are more educated than men.

If we look at the summary of the Labo.FM median is 0.7535 so the men are in labour force more often than women. In many countries women are at home taking care of household and it might explain something of this result. It is interesting that the maximum here is only 1.0380 which tells that women in average don´t work more often than men.

Edu.Exp seems to be quite normally distributed and the median here is 13.50 (expected years of schooling). The 1st and 3rd quartiles are quite near this but the minimum is only 5.40 telling that some of people won´t probably go to school at all or just for few years. Maximum 20.20 years is also quite high meaning that in some countries almost everyone gets at least 2nd education and even higher.

The median of Life.Exp is 74.20 years. Quite shockingly the minimum here is only 49 years which might be a consequense for example of high child mortality. The maximum is 83.50 years so in some country the life expectancy is almost two times larger than in the country where it is 49 years.
As we can see in GNI the range is huge. The minimum is only 581$ and the maximum 123 124$ while the median is 12 040$. We can see from the graph also that GNI is far from uniform (or even normal) distribution and therefore is quite inequal.

Almost the same distribution is true with the Mat.Mor but here the minimum is 1 death, median 49 deaths and maximum 1100 deaths and 50% is between 11.5 and 190 deaths. So this means that in some country there is only 1 death per 100 000 life births (0.001%) and in some country 1100 deaths (1.1%) so the differences are very big but in most countries the death rate is quite low.

The median of the Ado.Birth is surplisingly high, 33.60 which is 3.36% of womean ages 15-19. Also the range is quite big; minimum is 0.60 and maximum 204.80.

Parli.F tells much of the equality between the genders. The minimum here is 0 meaning that in some country there are no women in parliament. The maximum is 57.50 so in some country there are more women in parliament than men. The median is 19.30 so in most countries women are under represented in parliaments compared to men.

Principal component analysis (PCA)

Because there are 8 variables it would be hard to interpret the relations between them if we could use only pairwise comparisons. That´s why I will use principal component analysis (PCA) which reduces variables to combinations of the data. I will perform the PCA first on the non standardized human data and then on the standardized human data

PCA on non standardized data

pca1_human <- prcomp(human)
s1 <- summary(pca1_human)
s1
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000
##                           PC8
## Standard deviation     0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion  1.0000
pca1_pr <- round(100*s1$importance[2,], digits = 1) 
pca1_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 
## 100   0   0   0   0   0   0   0
pc1_lab <- paste0(names(pca1_pr), " (", pca1_pr, "%)")
biplot(pca1_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc1_lab[1], ylab = pc1_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

PCA on standardized data

human_std <- scale(human)
summary(human_std)
##     Edu2.FM           Labo.FM           Edu.Exp           Life.Exp      
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7378   Min.   :-2.7188  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6782   1st Qu.:-0.6425  
##  Median : 0.3503   Median : 0.2316   Median : 0.1140   Median : 0.3056  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.7126   3rd Qu.: 0.6717  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 2.4730   Max.   : 1.4218  
##       GNI             Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
pca2_human <- prcomp(human_std)
s2 <- summary(pca2_human)
s2
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
##                            PC7     PC8
## Standard deviation     0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion  0.98702 1.00000
pca2_pr <- round(100*s2$importance[2,], digits = 1) 
pca2_pr
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3
pc2_lab <- paste0(names(pca2_pr), " (", pca2_pr, "%)")
biplot(pca2_human, cex = c(0.8, 0.7), col = c("grey40", "deeppink2"), xlab = pc2_lab[1], ylab = pc2_lab[2])

As you can see the plots created are very different. If we look at first at the PCA on non standardized data we can see that the first component PC1 explains 99,99% of the variation and PC2 0,01% so as we can see from the cumulative proportion together they explain 100% of the variation. Because there are different units of measure in the data variances are quite different and the PCA is really hard to interpret on non standardized data.

If we then look at the PCA on standardized data we can see that there are more components with some proportions and 100% is acquired not until PC8. PC1 and PC2 together explain 69,84% of the variaton. Also the plot is now easier to interpret. If we look at the arrows we can see that there are quite strong correlations between Edu.Exp, Life.Exp, Edu2.FM and GNI and also between Mat.Mor and Ado.Birth as we already noticed earlier. These 6 variables also correlate strongly with PC1. Parli.F and Labo.FM correlate most with each other and PC2. We can also see that for example in Mozambique and Burundi the ratio of labour force participation of females and males in each country and share of seats in parliament are quite high and also the maternal mortality ratio and adolescent birth ratio. In Mozambique and Burundi the ratio of female and male populations with secondary education, expected years of schooling, life expectancy at birth and gross national income (GNI) per capita are quite low. For example Iran has quite opposite pattern.

Tea!

At first I will load the tea dataset from the FactoMineR package

library(FactoMineR); library(tidyr); library(ggplot2)
data("tea")
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36

There are 300 observations and 36 variables in this data.

I choose to keep the following columns: tea.time, tearoom, Tea, How, sugar, how and where.

keep_columns <- c("tea.time", "tearoom", "Tea", "How", "sugar", "how", "where")
tea_time <- dplyr::select(tea, one_of(keep_columns))
summary(tea_time)
##          tea.time          tearoom           Tea         How     
##  Not.tea time:131   Not.tearoom:242   black    : 74   alone:195  
##  tea time    :169   tearoom    : 58   Earl Grey:193   lemon: 33  
##                                       green    : 33   milk : 63  
##                                                       other:  9  
##       sugar                     how                       where    
##  No.sugar:155   tea bag           :170   chain store         :192  
##  sugar   :145   tea bag+unpackaged: 94   chain store+tea shop: 78  
##                 unpackaged        : 36   tea shop            : 30  
## 
str(tea_time)
## 'data.frame':    300 obs. of  7 variables:
##  $ tea.time: Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ Tea     : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How     : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar   : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how     : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where   : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...

As you can see from the summary there are the following options in the chosen variables: * tea.time: not tea time or tea time * tearoom: not in a tearoom or in a tearoom * Tea: black, earl grey or green * How: alone, with lemon, milk or other * sugar: no sugar or with sugar * how: with a tea bag, with a tea bag and loose leaf (unpackaged) or only loose leaf * where: from a chain store, from a chain store and a tea shop or only from a tea shop

gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables; they will
## be dropped

From the barplots we can see that most of the people have tea time, don´t drink at tearoom, drink earl grey, drink tea alone without any lemon or milk, use sugar, use only tea bags and buy their tea from a chain store.

Multiple Correspondence Analysis

I will use Multiple Correspondence Analysis (MCA) to inspect the connections between the variables I chose

mca <- MCA(tea_time, graph = FALSE)
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.280   0.233   0.185   0.160   0.148   0.141
## % of var.             16.335  13.580  10.802   9.327   8.633   8.213
## Cumulative % of var.  16.335  29.915  40.717  50.044  58.677  66.890
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11  Dim.12
## Variance               0.131   0.122   0.101   0.089   0.072   0.053
## % of var.              7.635   7.100   5.880   5.205   4.191   3.099
## Cumulative % of var.  74.525  81.625  87.505  92.711  96.901 100.000
## 
## Individuals (the 10 first)
##                 Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## 1            | -0.525  0.328  0.256 |  0.021  0.001  0.000 | -0.242  0.105
## 2            | -0.353  0.148  0.082 | -0.076  0.008  0.004 | -0.622  0.696
## 3            | -0.295  0.104  0.140 | -0.148  0.031  0.035 | -0.284  0.145
## 4            | -0.680  0.551  0.645 | -0.105  0.016  0.015 |  0.230  0.095
## 5            | -0.537  0.344  0.414 | -0.040  0.002  0.002 | -0.186  0.062
## 6            | -0.537  0.344  0.414 | -0.040  0.002  0.002 | -0.186  0.062
## 7            | -0.295  0.104  0.140 | -0.148  0.031  0.035 | -0.284  0.145
## 8            | -0.111  0.015  0.009 | -0.183  0.048  0.023 | -0.720  0.933
## 9            |  0.567  0.383  0.199 | -0.472  0.319  0.138 |  0.093  0.015
## 10           |  0.907  0.980  0.384 | -0.147  0.031  0.010 | -0.346  0.215
##                cos2  
## 1             0.054 |
## 2             0.255 |
## 3             0.129 |
## 4             0.074 |
## 5             0.050 |
## 6             0.050 |
## 7             0.129 |
## 8             0.359 |
## 9             0.005 |
## 10            0.056 |
## 
## Categories (the 10 first)
##                  Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## Not.tea time |  -0.505   5.684   0.198  -7.690 |   0.205   1.121   0.032
## tea time     |   0.392   4.406   0.198   7.690 |  -0.159   0.869   0.032
## Not.tearoom  |  -0.327   4.392   0.445 -11.538 |   0.046   0.104   0.009
## tearoom      |   1.363  18.324   0.445  11.538 |  -0.191   0.434   0.009
## black        |   0.462   2.684   0.070   4.570 |   0.178   0.479   0.010
## Earl Grey    |  -0.114   0.425   0.023  -2.643 |  -0.248   2.423   0.111
## green        |  -0.370   0.769   0.017  -2.250 |   1.050   7.443   0.136
## alone        |  -0.157   0.819   0.046  -3.704 |   0.146   0.849   0.040
## lemon        |   0.516   1.492   0.033   3.134 |   0.259   0.453   0.008
## milk         |  -0.049   0.026   0.001  -0.439 |  -0.401   2.069   0.043
##               v.test     Dim.3     ctr    cos2  v.test  
## Not.tea time   3.114 |   0.167   0.934   0.021   2.535 |
## tea time      -3.114 |  -0.129   0.724   0.021  -2.535 |
## Not.tearoom    1.620 |   0.016   0.016   0.001   0.572 |
## tearoom       -1.620 |  -0.068   0.068   0.001  -0.572 |
## black          1.760 |  -0.986  18.487   0.318  -9.753 |
## Earl Grey     -5.754 |   0.437   9.462   0.344  10.140 |
## green          6.383 |  -0.343   1.000   0.015  -2.087 |
## alone          3.438 |  -0.185   1.723   0.064  -4.368 |
## lemon          1.575 |   1.622  22.318   0.325   9.859 |
## milk          -3.572 |  -0.076   0.092   0.002  -0.674 |
## 
## Categorical variables (eta2)
##                Dim.1 Dim.2 Dim.3  
## tea.time     | 0.198 0.032 0.021 |
## tearoom      | 0.445 0.009 0.001 |
## Tea          | 0.076 0.169 0.375 |
## How          | 0.150 0.106 0.372 |
## sugar        | 0.070 0.012 0.393 |
## how          | 0.462 0.610 0.028 |
## where        | 0.560 0.692 0.106 |

Here we can see that the dimensions 1 and 2 cover 29,915% of the variance. From the last table we can see how well different variables correlate with the first three dimensions.

plot(mca, invisible=c("ind"), habillage = "quali")

From the plot above we can see for example that buying unpackaged tea is closely related to buying tea from a tea shop. Also buying only tea bags is closely related to buying tea from a chain store. Drinking earl grey and drinking tea with lemon are closely related and also drinking black tea and drinking tea with milk.